We have d arms. For example, arms are ads that we display to users each time they connect to a web page.
Each time a user connects to this web page, that makes a round.
At each round n, we choose one ad to display to the user.
At each round n, ad i gives reward ri(n) ∈ {0, 1}: ri(n) = 1 if the user clicked on the ad i, 0 if the user didn’t.
Our goal is to maximize the total reward we get over many rounds.
Step 1: At each round n, we consider two numbers for each ad i: * Ni(n) - the number of times the ad i was selected up to round n, * Ri(n) - the sum of rewards the ad i up to round n.
Step 2: From these two numbers we compute:
the average reward of ad i up to round n
ri(n) = Ri(n) / Ni(n)
the confidence interval [ri(n) - Δi(n), ri(n) + Δi(n)]
at round n with
Δi(n) = √(3 / 2) * (log(n) / Ni(n))
Step 3: We select the ad i that has the maximum UCB ri(n) + Δi(n)
.
«Previous | Next» |